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ADAPTIVE CONFIDENCE BANDS 

By Christopher Genovese and Larry Wasserman 
Carnegie Mellon University 

We show that there do not exist adaptive confidence bands for 
curve estimation except under very restrictive assumptions. We pro- 
pose instead to construct adaptive bands that cover a surrogate func- 
tion /* which is close to, but simpler than, /. The surrogate captures 
the significant features in /. We establish lower bounds on the width 
for any confidence band for /* and construct a procedure that comes 
within a small constant factor of attaining the lower bound for finite- 
samples. 



1. Introduction. 

1.1. Motivation. Let (x±, Y±), . . . , (x n , Y n ) be observations from the non- 
parametric regression model 

(1) Y i = f(x i ) + ae i 

where ej ~ N(0, 1), Xi G (0,1), and / is assumed to lie in some infinite- 
dimensional class of functions 7i. We are interested in constructing confi- 
dence bands (L, U) for /. Ideally these bands should satisfy 

(2) F f {L <f<U} = l-a for all / G H 

where L < f < U means that L(x) < f(x) < U(x) for all x G X, where X 
is some subset of (0, 1) such as X = {x}, X = {x±, . . . ,x n } or X = (0, 1). 
Throughout this paper, we take X = {x\, . . . , x n } but this particular choice 
is not crucial in what follows. 

Attaining ([2]) is difficult and hence it is common to settle for pointwise 
asymptotic coverage: 

(3) liminfP/jL < f < U} > 1 - a for all / G H. 

"Pointwise" refers to the fact that the asymptotic limit is taken for each 
fixed / rather than uniformly over / G TC. Papers on pointwise asymptotic 
methods include Claeskens and Van Keilegom (2003), Eubank and Speck- 
man (1993), Hardle and Marron (1991), Hall and Titterington (1988), Hardle 
and Bowman (1988), Neumann and Polzehl (1998), and Xia (1998). 

AMS 2000 subject classifications: 

1 
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Achieving even pointwise asymptotic coverage is nontrivial due to the 
presence of bias. If f(x) is an estimator with mean f(x) and standard devi- 
ation s(x) then 

f(x) - f(x) = f(x) - J(x) bias(x) 
s{x) s(x) ^variance (x) 

The first term typically satisifes a central limit theorem but the second term 
does not vanish even asymptotically if the bias and variance are balanced. 
For discussions on this point, see the papers referenced above as well as 
Ruppert, Wand, and Carroll (2003) and Sun and Loader (1994). 

Pointwise asymptotic bands are not uniform, that is, they do not control 

(4) inf ¥ f {L< f<U}. 

The sample size n(f) required for the true coverage to approximate the 
nominal coverage, depends on the unknown function /. 

The aim of this paper is to attain uniform coverage over Ti. We say that 
B = (L,U) has uniform coverage if 

(5) MF f {L<f<U}>l-a. 

Starting in Section EJ we will insist on coverage over Ti = {all functions}. 

The bound in ([5|) can be achieved trivially using Bonferroni bands. Set 
ti = Y{ — c n a and Uj = Yi + c n a, where c n = <I> _1 (1 — a I In) and $ is the 
standard Normal cdf. Yet this band is unsatisfactory for several reasons: 

1. The width of the band grows with sample size. 

2. The band is centered on a poor estimator of the unknown function. 

3. The width of the band is independent of the data and hence cannot 
adapt to the smoothness of the unknown function. 

Problems (1) and (2) are easily remedied by using standard smoothing meth- 
ods. But the results of Low (1997) suggest that (3) is an inevitable conse- 
quence of uniform coverage. 

The smoother the functions in Ti, the smaller the width necessary to 
achieve uniform coverage. Suppose that T <zTi contains the "smooth" func- 
tions in Ti and that Ti — T is nonempty. Uniform coverage over Ti requires 
that the width of fixed-width bands be driven by the "rough" functions in 
Ti — T\ the width will thus be large even if / G T . Ideally, our procedure 
would adjust automatically to produce narrower bands when the function is 
smooth (/ € J-) and wider bands when the function is rough (/ F), but to 
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do that, the width must be determined from the data. Low showed that for 
density estimation at a single point, fixed-width confidence intervals perform 
as well as random length intervals; that is, the data do not help reduce the 
width of the bands for smoother functions. In Section [21 we extend Low's 
result to nonparametric regression and show that the phenomenon is quite 
general. Without restrictive assumptions, confidence bands cannot adapt. 

These results mean that the width of uniform confidence bands is de- 
termined by the greatest roughness we are willing to assume. Because the 
typical assumptions about TC in the nonparametric regression problem are 
loosely held and difficult to check, the result is that the confidence band 
widths are essentially arbitrary. This is not satisfactory in practice. 

The contrast with L 2 confidence balls is noteworthy. L 2 confidence sets 
have been studied by Li (1999), Juditsky and Lambert-Lacroix (2002), Be- 
ran and Diimbgen (1998), Genovese and Wasserman (2004), Baraud (2004), 
Hoffman and Lepski (2003), Cai and Low (2004), and Robins and van der 
Vaart (2004). Let 

(6) 5= j/Gl-: lfyfi-fi) 2 < R l] 
for some / and suppose that 

(7) inf P/{/£ B}>l-a. 



Then 

(8) inf E f (R n ) >-^j, and sup E f (R n ) > C 2 

where C\ and C 2 are positive constants. Moreover, there exist confidence 
sets that achieve the faster n -1 / 4 rate at some points in W 1 . Because fixed- 
radius confidence sets necessarily have radius of size 0(1), the supremum in 
([8]) implies such confidence sets must have random radii. We can construct 
random-radius confidence balls that improve on fixed-radius confidence sets, 
for example, by obtaining a smaller radius for subsets of smoother functions 
/. I? confidence balls can therefore adapt to the unknown smoothness of 
/. Unfortunately, confidence balls can be difficult to work with in high di- 
mensions (large n) and tend to constrain many features of interest rather 
poorly, for which reasons confidence bands are often desired. 

It is also interesting to compare the adaptivity results for estimation and 
inference. Estimators exist (e.g., Donoho et al. 1995) that can adapt to 
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unknown smoothness, achieving near optimal rates of convergence over a 
broad scale of spaces. But since confidence bands cannot adapt, the min- 
imum width bands that achieve uniform coverage over the same scale of 
spaces have width O(l), overwhelming the differences among reasonable es- 
timators. We are left knowing that we are close to the true function but 
being unable to demonstrate it inferentially. 

The message we take from the nonadaptivity results in Low (1987) and 
Section [2] of this paper is that the problem of constructing confidence bands 
for / over nonparametric classes is simply too difficult under the usual defi- 
nition of coverage. Instead, we introduce a slightly weaker notion - surrogate 
coverage - under which it is possible to obtain adaptive bands while allowing 
sharp inferences about the main features of /. 

1.2. Surrogates. Figure Q] shows two situations where a band fails to 
capture the true function. The top plot shows a conservative failure: the 
only place where / is not contained in the band is when the bands are 
smoother than the truth. The bottom plot shows a liberal failure: the only 
place where / is not contained in the band is when the bands are less smooth 
than the truth. The usual notion of coverage treats these failures equally. 
Yet, in some sense, the second error is more serious than the first since the 
bands overstate the complexity. 

We are thus led to a different approach that treats conservative errors 
and liberal errors differently. The basic idea is to find a function /* that is 
simpler than / as in Figure El We then require that 

(9) F f {L < f < U or L < f* < U} > 1 - a, for all functions /. 

More generally, we will define a finite set of surrogates F* = F*(f) = 
{/> /*)•••> fm} an d require that a surrogate confidence band (L, U) satisfy 

(10) inf F f {L <g<U for some g G F*} > 1 - a. 

We will also consider bands that are adaptive in the following sense: if / lies 
in some subspace then with high probability \\U — L||oo < w {J~)i where 
w{T) is the best width of a uniformly valid confidence band (under the usual 
definition of coverage) based on the a priori knowledge that / G T . Among 
possible surrogates, a surrogate will be optimal if it admits a valid, adaptive 
procedure and the set {/ £ T : F*(f) = {/}} is as large as possible. 

1.3. Summary of Results. In Section we show that Low's result on 
density estimation holds in regression as well. Fixed width bands do as well 
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Fig 1 . The top plot shows a conservative failure: the only place where f is not contained 
in the band is when the bands are smoother than the truth. The bottom plot shows a liberal 
failure: the only place where f is not contained in the band is when the bands are less 
smooth than the truth. The usual notion of coverage treats these failures equally. 
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as random width bands, thus ruling out adaptivity. We show this when 7i 
is the set of all functions and when TL is a ball in a Lipschitz, Sobolev, or 
Besov space. 

Section [3] gives our main results. Theorem [T7] establishes lower bounds on 
the width for any valid surrogte confidence band. Let J 7 be a subspace of 
dimension d in W 1 . The functions that prevent adaptation are those that 
are close to T in L 2 but far in L°° . Loosely speaking, such functions are 
close to T except for isolated, spiky features. If ||/ — II/H2 < e 2 and ||/ — 
n/||oo > £001 for tuning constants e2,eoo, define the surrogate /* to be the 
projection of / onto J 7 , Uf. Otherwise, define /* = /. We show that if 
Ff{\\U-L\\oo < w} > 1 -7 for all / e F, then 

(11) w > max(wp (a, 7, a), v(e 2 , €00, n,d,a, 7, a)) , 

where is the minimum width for a uniform confidence band knowing a 
priori that / G T and v(e 2 , ^00,^, d, a, 7) is described later. 

Corollary 1291 shows that for proper choice of e 2 and eoo, the v term in 
the previous equation can be made smaller than w^. Figure [3] represents the 
functions involved; the gray shaded area are those functions that are replaced 
by surrogates in the coverage statement, denoted later by 5(e2,eoo). These 
are the functions that are both hard to distinguish from T (because they 
are close to it) and hard to cover (because they are "spiky"). The optimal 
choice of e 2 and minimizes the volume of this set while making the right 
hand side in inequality (|lip equal to w^. Put another way, the richest model 
that permits adaptive confidence bands under the usual notion of coverage 
is T = R n -S(e 2 ,e 00 ). 

Theorem 1281 gives a procedure that comes within a factor of 2 of attaining 
the lower bound for finite-samples. The procedure conducts goodness of fit 
tests for subspaces and constructs bands centered on the estimator of the 
lowest dimensional nonrejected subspace. Such a procedure actually reflects 
common practice. It is not uncommon to fit a model, check the fit, and if the 
model does not fit then we fit a more complex model. In this sense, we view 
our results as providing a rigorous basis for common practice. It is known 
that pretesting followed by inference does not lead to valid inferences for / 
(Leeb and and Potscher, 2005). But if we cant accept that sometimes we 
cover a surrogate /* rather than /, then validity is restored. 

These results are proved in Section HI 

1.4. Related Work. The idea of estimating the detectable part of / is 
present, at least implicitly, in other approaches. Davies and Kovac (2001) 
separate the data into a simple piece plus a noise piece which is similar 
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in spirit to our approach. Another related idea is scale-space inference due 
to Chaudhuri and Marron (2000) who focus on inference for all smoothed 
versions of / rather than / itself. Also related is the idea of oversmoothing 
as described in Terrell (1990) and Terrell and Scott (1985). Terrell argues 
that "By using the most smoothing that is compatible with the scale of the 
problem, we tend to eliminate accidental features." The idea of one-sided in- 
ference in Donoho (1988) has a similar spirit. Here, one constructs confidence 
intervals of the form [L, oo) for functionals such as the number of modes of 
a density. Bickel and Ritov (2000) make what they call a "radical proposal" 
to " ... determine how much bias can be tolerated without [interesting] fea- 
tures being obscured." We view our approach as a way of implementing their 
suggestion. Another related idea is contained in Donoho (1995) who showed 
that if / is the soft threshold estimator of a function and f(x) = Y^j dji^j{x) 

is an expansion in an unconditional basis, then P/j/ ■< /} > I — a where 

/ = J2j OjTpj an d / ^ / means that \9j\ < \6j\ for all j. Finally, we remind 
the reader that there is a plethora of work on adaptative estimation; see, for 
example, Cai and Low (2004) and references therein. 

1.5. Notation. If L and U are random functions on X = {x±, . . . ,x n } 
such that L < U, we define B = (L, U) to be the (random) set of all 
functions g on X for which L < g < U. We call B (or equivalently, the 
pair L, U) a band; the band covers a function / if / G B (or equivalently, if 
L < f < U). Define its width to be the random variable 

(12) W=\\U- LIU = max {U{x t ) - L(x t )). 

l<i<n 

Because we are constructing bands on X = {x±, . . . ,x n }, we most often 
refer to functions in terms of their evaluations / = (f(x\), . . . , f(x n )) 6 W 1 . 
When we need to refer to a space of functions to which / belongs, we use 
a ~ to denote the function space and no ~ to denote the vector space of 
evaluations. Thus, if A is the space of all functions, then A = M ra . In both 
cases, we use the same symbol for the function and let the meaning be clear 
from context; for example, / G A is the function and / G A is the vector 
(f(xi), . . . , f(x n )). Define the following norms on R n : 



1 

-TP 

t=i 



= max|/i|. 

% 

We use (•, •) to denote the inner product (/, g) = ^ Y%=i f*9i corresponding 
to II • ||. 



imsart-aos ver. 2006/10/13 file: genovese-wasserman07.tex date: February 2, 2008 



CONFIDENCE BANDS 



9 



If J 7 is a subspace of R n , we define Djr to be the Euclidean projection 
onto J 7 , using just II if the subspace is clear from context. We use 



(13) et = (0,..., 0,1,(^^0 

i— 1 times n—i times 



to denote the standard basis on W 1 . 

If Fq is a family of CDFs indexed by 9, we write F@ (a) to denote the 
lower-tail a-quantile of Fg. For the standard normal distribution, however, 
we use z a to denote the upper-tail a-quantile, and we denote the CDF and 
PDF, respectively, by $ and <f>. 

Throughout the paper we assume that a is a known constant; in some 
cases we simply set a = 1. But see Remark 1211 about the unknown a case. 

2. Nonadaptivity of Bands. In this section we construct lower bounds 
on the width of valid confidence bands analagous to © and we show that 
the lower bound is achieved by fixed-width bands. 

Low (1997) considered estimating a density / in the class 



< M 



F(a, k, M) = j/ : / > 0, J f = 1, f(x ) < a, \\f k \x)\\ 
He shows that if C n is a confidence interval for /(0), that is, 

f€J-(a,k,M) 

then, for every e > 0, there exists N = N(e, M) and c > such that, for all 
n > N, 

(14) E / (length(C n )) > c>i- fc/(2fc+1) 

for all / G J-(a,k,M) such that /(0) > e. Moreover, there exists a fixed- 
width confidence interval C n and a constant c\ such that Ej(length(C n )) < 
cin~ k /( 2k+1 ^ for all / € !F(a, k, M). Thus, the data play no role in construct- 
ing a rate-optimal band, except in determining the center of the interval. 

For example, if we use kernel density estimation, we could construct an 
optimal bandwidth h = h(n, k) depending only on n and k - but not the 
data - and construct the interval from that kernel estimator. This makes 
the interval highly dependent on the minimal amount of smoothness k that 
is assumed. And it rules out the usual data-dependent bandwidth methods 
such as cross-validation. 
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Now return to the regression model 
(15) Yi = fi + aei, i = l,...,n, 

where e±, . . ., e n are independent, Normal(0, 1) random variables, and / = 

(/i,...,/„) eK». 

Theorem 1. Let B = (L, [/) be a I — a confidence band over Q, where 

< a < 1/2 and let g £ <d. Suppose that contains a finite set of vectors 

1. for every distinct pair /, v £ we /iai>e {f — g,v — g) = and 

2. /or some < e < (1/2) - a, 

e "ll/-9l| 2 /^ 

(16) max — <e. 

fen \n\ 

Then, 

(17) E g (W) > (1 - 2a - 2e) mm 1 1 5 - /| U- 

We begin with the case where = M n . We will obtain a lower bound 
on the width of any confidence band and then show that a fixed-width 
procedure attains that width. The results hinge on finding a least favorable 
configuration of mean vectors that are as far away from each as possible in 
L°° while staying a fixed distance e in total-variation distance. 

Theorem 2. Let H = R n and fix < a < 1/2. Let B = (L,U) be a 

1 — a confidence band over TL. Then, for every < e < (1/2) — a, 

(18) mf n E f (W) > (1 - 2a - 2e)a^log(ne^). 

The bound is achieved (up to constants) by the fixed-width Bonferroni bands: 
£i = Yi o~z a / n , Ui = Y-i + Gz a j n . 
Theorem 3 (Lipshschitz Balls). Define X{ = i/n for 1 < % < n. Let 

(19) H(L) = [f:\f(x)-f(y)\<L\x-y\, x, y G [0, 1] j , 

be a ball in Lipschitz space, and let 

(20) H(L) = {(/(xi),...,/(x n )): f €H(L)} 
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be the vector of evaluations on X Fix < a < 1/2 and let B = (L, U) be a 
1 — a confidence band over Tt(L). Then, for every < e < (1/2) — a, 

(21) inf E f (W)> a n 



where 



logny/ 3 fLa 2 \ 1/3 



n J \ 2 J 

L 3 log(l + e 2 ) 2 log(L/(2ff)) lo g (§ lo § n + lo s( 1 + € ^ + I lo g( L /( 2 ^)) 



log n log n log n 

The lower bound is achieved (up to logarithmic factors) by a fixed-width 
procedure. 

Theorem 4 (Sobolev Balls). Let7i.(p,c) be a Sobolev ball of order p and 
radius c and let B = (L,U) be a 1 — a confidence band over Tt(p,c). For 
every < e < (1/2) — a, for every 5 > 0, and all large n, 

(22) inf E F (W) > (1 - 2a - 2e) ( 

for some c n that increases at most logarithmically. The bound is achieved 
(up to logarithmic factors) by a fixed-width band procedure. 

Theorem 5 (Besov Balls) . Let TC(p, q, £, c) be ball of size c in the Besov 
space B^ q and et B = (L,U) be a 1 — a confidence band over TL(p,q,^,c). 
For every < e < (1/2) — a, and every 5 > 0, 

(23) inf E f (W) > c(l -2a- 2e) n - 1 ^ 1 ^- 1 / 2 \ 

The bound is achieved (up to logarithmic factors) by a fixed-width procedure. 

3. Adaptive Bands. Let {Ft '■ T £ T} be a scale of linear subspaces. 
Let wt denote the smallest width of any confidence band when it is known 
that / G J~t (defined more precisely below). We would like to define an 
approporiate surrogate and a procedure that gets as close as possible to the 
target width wt when / G Tt. To clarify the ideas, subsection 13.21 develops 
our results in the special case where the subspaces are {J-, W 1 } for a fixed 
T of dimension d < n. Subsection 13.31 handles the more general case of a 
sequence of nested subspaces. 
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3.1. Preliminaries. We begin by defining several quantities that will be 
used throughout. Let r(e) denote the total variation distance between a 
N(0, 1) and a N(e, 1) distribution. Thus, 

(24) r(e) = *(e/2)-*(-€/2). 

Then, e0(e/2) < r(e) < e^(0) and r(e) ~ e<£(0) as e -» 0. 

Lemma 6. If P = N(f, a 2 1) and Q = N(g, a 2 1) are multivariate Nor- 
mals with f,g€ K n then 



(V3L 



-9\ 



(25) al TV (P,Q) = T 

y a J 

We will need several constants. For < a < 1 and < 7 < 1 — 2a define 

(26) «(a,7) = (21og(l + 4(l- 7 -2a) 2 ) N 



For < /3 < 1 — £ < 1 and integer m > 1 define Q = Q(m, (3,£) to be the 
solution of 

(27) Z = l-F 0>m (F-i(p)), 



where F a d denotes the CDF of a % 2 random variable with d degrees of freedom 
and noncentrality parameter a 

Lemma 7. There is a universal constant A(/3, £) smc/i i/iai Q(m, (3, £) < 
A(/3, £) /or a/Z m > 1. For example, A(.05, .05) < 6.25. Suppose now that 
m = m n , j3 = p n , and £ = £ n are a// functions of n. As long as — log /3 n < 
logn and -log£ n < \f\ogn, then Q(m n , (3 n ,£ n ) = 0{y/\ogn). 

Next, define 

(28) E(m, a, 7) = max(Q(m, a, 7), 2n{a, 7)), 

for < a < 1 and < 7 < 1 - 2a. 

Finally, if J 7 is a subspace of dimension d, define 

(29) IV = max » Ujr ' ^ 



l<i<n WeA 



where e, is defined in equation (|13p . Note that < iljr < 1. The value of 
f2jr relates to the geometry of J 7 as a hyperplane embedded in M. n , as seen 
through the following results. 
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Lemma 8. Let J 7 be a subspace ofW 1 . Then 

(30) min< ||u|| : v G T , ||u||oo = e > = JL 

[ J y/rXl F 

(31) max| ||u ||oo : v & J 7 , \\v\\ = e j = t\fn£lj:. 

Lemma 9. Let {(pi, . . . , (pd\ be orthonormal vectors with respect to 
in M. n and let T be the linear span of these vectors. Then 



(32) Qt = max 

l<i<n V n 

In particular, if maxj maxj 4>j{i) < c f/ien 

(33) < c^. 

Lemma 10. Let {(pi, . . . , 4>a\ be orthonormal functions on [0, 1]. Define 
TCj to be the linear span of {(pi, ■ ■ ■ , (pj}. Let X{ = i/n, i = l,...,n and 
Tj = {/ = (h(xi), . . . , h(x n )) : h eHj}. Then, 



(34) ^ = ^f Xi) +Q(l/n). 
In particular, i/maxj sup^, (pj(x) < c then 

(35) to? < cJ- + 0(l/n). 

V n 

In addition, we need the following Lemma first proved, in a related form, 
in Baraud (2003). 

Lemma 11. Let J 7 be a subspace of dimension d. Let < 5 < 1 — £ and 

(36) e =^|^(21o g (l + 4^)) 1/4 . 

De^ne A = {/ : ||/ - n^/|| > e}. ITien, 

(37) 13= inf su P P f {<^ = 0} > l-£-6 

<f>a£®s feA 

imsart-aos ver. 2006/10/13 file: genovese-wasserman07.tex date: February 2, 2008 



14 



GENOVESE AND WASSERMAN 



where 



(38) **=|&: supP/{^ = 0} < C| 

is the set of level £ tests. 

3.2. Single Subspace. To begin, we start with a single subspace T of 
dimension d. 

Definition 12. For given e2,eoo > 0, define the surrogate /* of f by 

(39) r -- 



Uf if ||/-n/|| 2 <e 2 and HZ-n/IU >e, 
/ otherwise. 



CO 



Define the surrogate set of f, F*(f) = {/,/*}, which will be a singleton 
when f* = f. Define the spoiler set 5 (eg, Eqo) = {/ G W l : /* ^ /} and i/ie 
invariant set J(e 2 , £oo) = {/ : /* = /}■ 

We give a schematic diagram in Figurel3l The gray area represents 5(62, £oo)- 
These are the functions that preclude adaptivity. Being close to J- in L 2 
makes them hard to detect but being far from T in L°° makes them hard 
to cover. To achieve adaptivity we must settle for sometimes covering ILp/. 

3.2.1. Lower Bounds. We begin with two lemmas. The first controls the 
minimum width of a band and the second controls the maximum. The sec- 
ond is of more interest for our purposes; the first lemma is included for 
completeness. For any 1 < p < oo, e > 0, and A C M. n define 

(40) M p (e,A) = S up{d TV (P f ,P g ) : f,g € A, \\f - g\\ p < e} 
and 

(41) rriaoie, A , At) =inf {d T v(Pf,P g ) ■ f € A ,g G A x , ||/ - gW^ > e}. 

Lemma 13. Suppose that inf j^A^fiL < f < U} > I— a. Let 1 < p < oo 
and e > 0. For f £ A, define 

e(f, q) = sup{||/ - h\\ q : h & A, \\f — h\\ p < e}, 

where 1 < q < oo. Then, for any Aq C A, 

(42) inf F f {W > e(/,oo)} > 1 - 2a - sup M p (e(f,p), A) 
feA /6Ao 
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Fig 3. The dot at the center represents the subspace T . The shaded area is the set of spoilers 
S(e2,eoo) of vectors for which f* =fc f . If these vectors were not surrogated, adaptation is 
not possible. The non-shaded area is the invariant set I(c2, £oo) = {/ : /* = /}■ 
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where W = \ \U — -L||oo- If every point in A is contained in a subset of A of 
£ p -diameter e, then e(f,p) = e, and 

(43) inf F f {W>e} > 1 - 2a - M p {e, A). 

Lemma 14. Suppose that inf/ e ^P/{L < / < U} > 1 — a. Suppose that 
A = Aq U Ai (not necessarily disjoint). Let e > be such that for each 
f 6 Aq there exists g £ A\ for which \\f — g\\oo = e - Then, 

(44) sup¥ f {W > e}>l-2a-m oo (e,A ,A 1 ) 
feA 

where W = \\U — L\\oo- 

Now we establish the target rate, the smallest width of a band if we knew 
a priori that / 6 T . Define 

(45) = wjr(a, 7, cr) = £V0-t -1 (1 - 2a - 7). 
Theorem 15. Suppose that 

(46) MF f {L<f<U}>l-a. 

If inf j g jr Ff{ W < w} > 1 — 7 i/ien w > w^. 

A band that achieves this width, up to logarithmic factors, is (L, U) = f±c 
where f = UY and c = a(HH T )iiZ a /2n- 



Remark 16. Using an argument similar to that in Theorem [71 it 



IS 



possible to improve this lower bound by an additional ^/log d factor, but this 
is inconsequential to the rest of the paper. 

Next, we give the main result for this case. 

(47) ■uo(e 2 ,e 00 ,n,a,7,cj) = min|^/ne 2 , eoo, ar~ l (l - 2a - 7)}, 

fA \ 1 a \ / if £ 2 > 2t> 2 (n,d, a, 7) 

(48) vi(e 2 ,n,d,a,j,a) = < , , ; ' '( 

[ V2{n,d,a,j) if eg < 2v 2 (n,d, a, 7), 

(49) v 2 (n, c£, a, 7) = 7 )(n-d) 1 / 4 n- 1 / 2 
and define 

(50) -u(e2,eoo,n,d, 0,7, a) = max(f) ,wi). 
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Theorem 17 (Lower Bound for Surrogate Confidence Band Width). 
Fix < a < 1 and < 7 < 1 — 2a. Suppose that for bands B = (L, U) 

(51) inf P f {F(/)nB^} > I- a. 
Then, 

(52) inf F f {W < w} > 1 -7. 

implies 
(53) 

u> > w(J^, e 2 , £00, n, d, a, 7, cr) = max{ii;jr(a, 7, a), t> (e 2 , £00, n, d, a, 7, cr)| 



The inequality (j51h ensures that is a valid surrogate confidence band: 
for every function, either the function or its surrogate is covered with at 
least the target probability. The result gives a probabilistic lower bound on 
the width of the band that is at least as big as the best a priori width for 
the subspace. As we will see, with proper choice of e 2 and eoo, the v term 
can be made small, giving the subspace width for the lower bound. 

Next, we address the question of optimality. Consider, for example, the 
trivial surrogate that maps all functions to 0. We can cover the surrogate us- 
ing width bands with probability 1, but this would not be too interesting. 
There is a tradeoff between the width of the bands on low dimensional sub- 
spaces and the volume of the spoiler set, the functions that are surrogated. 
We characterize optimality here as minimizing the volume of the spoiler set 
<5(e2,£oo) while still attaining the target width with high probability when 
/ truly lies in the subspace. In this sense, the surrogate defined above is 
optimal. 

Theorem 18 (Optimality). Letw denote the right hand side of inequal- 
ity if53j). Then w > w?, where is defined in (45\ ). Setting 

e 2 = 2k(q, 7)(n - ef) 1/4 n~ 1/2 , = 

minimizes Volume(5(e2, £oo)) subject to achieving the lower bound on w. 

3.2.2. Achiev ability. Having established a lower bound, we need to show 
that the lower bound is sharp. We do this by constructing a finite-sample 
procedure that achieves the bound within a factor of 2. Let F a ^ denote the 
CDF of a x 2 random variable with d degrees of freedom and noncentrality 
parameter a and let Xad = ^odO- ~ a )' ^ = — ni^|| 2 and define 

(54) B = (L, U) = f± ca 
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where 

(55) 

and 



Y if T > X \ 
UY if T < x 2 



7,n— d 
2 

■y,n—d 



(56) c= V2 „x{^ + E ~ l T T -A-* 

Theorem 19. // 

( 57 ) 7 > 1 " F , n _ d (F-| n _>/2)) 
i/ien 

(58) m| ri P / {F*(/)nS^0}>l-a 
and 

(59) inf P/jW < u7jF + eoo } > 1 -7. 

If e2> E(n — d,a/2,^f){n — dY^n" 1 / 2 , where E(m,a,^) is defined in 
then 

(60) inf ¥ f {W < 2u;(^,e 2 ,e 00 ,a,7,n,(i)} > 1 - 7. 

where w^, €2, eoo, a, 7, n, d) is defined [53\) . Hence, the procedure adapts to 
within a logarithmic factor of the lower bound w given in Theorem [7?| 

Corollary 20. Setting 

e 2 = E(n - d, a/2, 7 )(n - d) 1/4 n~ 1/2 , = w T 

in the above procedure, minimizes Volume(5(e2, £oo)) subject to satisfying 

Remark 21. The results can be extended to unknown a by replacing a 
with a nonparametric estimate a. However, the results are then asymptotic 
rather than finite sample. Moreover, a minimal amount of smoothness is re- 
quired to ensure that a consistently estimates a; see Genovese and Wasser- 
man (2005). So as not to detract from our main points, we continue to take 
a known. 
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3.2.3. Remarks on Estimation and the Modulus of Continuity. It is in- 
teresting to note that the bands defined above cover the true / over a set V 
that is larger than T '. In this section we take a brief look at the properties 
of V. 

Define 

(61) C(a, a, b) = sup (au + b) ( 1 - a - ~ + ~$(-u/2) \ , 

u>o \ 4 2 / 

and let C(a) = C(a, 1,0). Let T 1 - be the orthogonal complement of T. 
Let B^(0,e) be a £ k -hsll around in T L [k = 2,oo). For / 6 R n , let 
B£(f,e) = f + B£(0,e). Define 

(62) V = V(F,e 2 ,e 00 ) = |J ( ^(/, e 2 ) n ^(Z, J . 

Lemma 22. Lei £? = (L, C/) be defined as in Then 

(63) MF f {L<f<U}>l-a. 

Let T/ = fi. The next lemma gives the modulus of continuity (Donoho 
and Liu 1991) of T over V which measures the difficulty of estimation over 
V . The modulus of continuity of T over a set A is 

(64) u(u,A) = sup{\Tf-Tg\ : ||/ - g\\ 2 < u;f,g G A}. 

Donoho and Liu showed that the difficulty of estimation over A is often 
characterized by uj(l/y/n,A) in the sense that this quantity defines a lower 
bound on estimation rates. 

Lemma 23 (Modulus of Continuity). We have 



(65) u(u, V) = [ nOy^y J~Pq2 + min ( TfqTfp ' 62 A ( e °°/^ 



Note that when e 2 = £oo = and ~ \/d/n, we have u(\/yfn, A) ~ yd/n 
as expected. However, when e = e 2 = tool^fn is large we will have that 
uj(l/y/n,A) ~ y/d/n + e/yl~+ <i 2 /n. The extra term e/yT+ c? 2 /n reflects 
the "ball-like" behavior of V in addition to the subspace-like behavior of V. 
The bands need to cover over this extra set to maintain valid coverage and 
this leads to larger lower bounds than just covering over T . 
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3.3. Nested Subspaces. Now suppose that we have nested subspaces T\ C 
• • • C T m C J-'m+i = K n . Let Uj denote the projector onto Tj. We define 
the surrogate as follows. 

Definition 24. For given e 2 = (e 2 ,i , • • • , e 2 , m ) and eoo = , . . . , €oo,m) 
define 

(66) J(f) = {l<j <m: \\f - ILjf\\ 2 < e 2)i and ||/ - RjfW^ > e^j. 
T/ien define the surrogate set 

(67) F*(/) = {n,/ : ie,7(/)}u{/}. 

Definition 25. VFe say that B = {g : L < g < U} = (L,U) has 
coverage 1 — a if 

(68) inf Ff{F*nB^®} > I- a. 
3.3.1. Lower Bounds. 

Theorem 26 (Lower Bound for Surrogate Confidence Band Width). 
Fix < a < 1 and < 7 < 1 — 2a. Suppose that for bands B = (L, U) 

(69) M n F f {F*(f) fl B ^ 0} > 1 - a. 
T/ten 

(70) inf P/{Pv" < > 1 -7. 
implies 

(71) n; > w (J r j ,e 2 ,j,e 00t j,n,dj,a, 7, a), 
where w is given in Theorem \ 1 1\ 

Theorem 27 (Optimality). Letw denote the right hand side of inequal- 
ity ( 71). Then w > w^, where is defined in f^5[ ). Setting 



e 2j = 2«(a,7)(n - dj) 1/4: n 1/2 , e^- = iu^ 
minimizes the volume of the set 

(72) {/ : ||/ - IT, /H < e 2J and ||/ - ^/lloc > e 2 ,oo} 

subject to achieving the lower bound on w. 
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3.3.2. Achiev 'ability. Define Tj = \ \Y — IIjY|| 2 and / = II yY, where 
(73) J = min{l < j < m : Tj < X^n-^-l, 

where J = m + 1 if the set is empty, and define 



(74) 



z ctj /2n 



( a j) + e oo,j if 1 < J < m 
1 if j = m + 1. 



Finally, let B = (L, U) = f ± cjtr where J2j a j < °- 
Theorem 28. //, 

(75) 7 > 1 - 

(76) inf P f {F*nB/l} > I - a. 



Lei = wjr^aj) + e^j. If w\ < ■ ■ ■ < w m+1 th 



en 



(77) inf ¥ f {W<w j }>l- 1 . 

If in addition €2j > E{n — dj,aj,-y)(n — dj) 1//4 n _1//2 and eooj < wp. then 

(78) inf ¥ f {W < 2w(e 2 j,e ooJ ,a j ,>y,n,d j )} > 1 - 7 

where w(e2j, Cooji 7> ^ is defined {53\). Hence, the procedure adapts 
to within a logarithmic factor of the lower bound w given in Theorem 

Corollary 29. Suppose a% = ■ ■ ■ = a m+ i = aj(ra + 1). Then w\ < 



•• < "Wjn+i 50 (77) holds. Moreover, setting 



(79) e 2j - = E(n-d j ,a j ,-f)(n-d j ) 1/4 n- 1/2 
and 

(80) eooj = w T . 

in the above procedure, minimizes the volume of the set satisfying [71 



Example 30. Suppose that Xi = i/n and let Bi = [0, l/d],B 2 = (l/d,2/d\, 
. . ., Bd = ((d — l)/d, 1]. Write f = (f(xi) : i = 1, . . . ,n) and let T denote 
the subspace of vectors f that are constant over each Bj . Then Qjr = yjd/n. 
The above procedure then produces a band with width no more that 0(\/d/n) 
with probability at least 1 — 7. 
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4. Proofs. In this section, we prove the main results. We omit proofs for 
a few of the simpler lemmas. Throughout this section, we write x n = 0*(b n ) 
to mean that x n = 0(c n b n ) where c n increases at most logarithmically with 
n. 

The following lemma is essentially from Section 3.3 of Ingster and Suslina 
(2003). 

Lemma 31. Let M be a probability measure on W a and let 

Q(-) = J Pf(-)dM(f) 

where P/(-) denotes the measure for a multivariate Normal with mean f = 
(/l) • • • > fn) and covariance a 2 1. Then 



(81) Li(Q,P g ) < ^ j /expj n(/ y 9) }dM(f)dM(i 
In particular, if Q is uniform on a finite set Q, then 



(82) L 1 (Q,P g )< 



\ 



±Y £exp^-^-^ 



H / L a 2 



Proof. Let pf denote the density of a multivariate Normal with mean / 
and covariance a 2 I where / is the identity matrix. Let q be the density of 
Q: 

q(y) = I Pf(y)dM(f). 



Then, 

\Pg{x) ~ q(x)\ 



\p g {x) - q{x)\dx = / 



(p s (x) -g(x)) 2 ^_ / f q 2 {x) 



(83) 5 '// »<») "^Vi^- 1 - 



imsart-aos ver. 2006/10/13 file: genovese-wasserman07.tex date: February 2, 2008 



CONFIDENCE BANDS 



23 



Now, 
2/ 



Pgi? 



-dx 



q{x) 



(x) 



Pg( x ) J 
Pf(x)p u (x) 
p 2 g {x) 



,Pg( a 



Pg{x)dX = Eg 

dM(f)dM{v) 

exp {-^2 (11/ -g\? + \W- Sll 2 )} E g (exp {e T (/ + v - 2g)/a 2 }) dM(f)dM(u) 
6XP {"2^ (ll/ ~9\\ 2 + \\v- 9\\ 2 )) exp If^ft -n + Vi- 5i ) 2 /(2^)| dM(f)dM(i 



exp 



n(f -g,v-g) 







dM{f)dM(v) 



and the result follows from 



□ 



Proof of Theorem [TJ Let N = \Q\ and let b 2 = nmax/ g ^ ||/ — g\\ 2 . 
Let p f denote the density of a multivariate Normal with mean / and covari- 
ance a 2 I where / is the identity matrix. Define the mixture 



N 



fen 



By Lemma [3X1 

\p g (x) - q(x)\dx < 



\ VJV/ /,1/en 



n{f-g,v-g) 



OX 



< 



Ne b2 /° 2 + N(N - 1) 



e. 



Define two events, A = {£ < g < u} and B = {I < f < u, for some / G 
Q}. Then, A n B C {w n > a} where 

a = min||s - /Hoc- 

Since P/{^ < / < u} > 1 - a for all /, it follows that P/{£} > 1- a for all 
/ € O. Hence, Q(S) > 1 - a. So, 

P g {w n >a} > P 9 {AnB}>Q(An.B)-e = Q(.4) + Q(B)-Q(AUB)-e 

> Q(A) + Q(5) - 1 - e > Q(A) + (1 - a) - 1 - e > F g {A} + (1 - a) - 1 - 2e 

> (1 - a) + (1 - a) - 1 - 2e = 1 - 2a - 2e. 
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So, E g (w n ) > (1 - 2a - 2e)a. □ 
Proof of Theorem [2j Let g G R n be arbitrary, let 



an = cry log(ne 2 ) 

and define 

fi = jflf+ (a n ,0,...,0), 5+ (0,a n ,... ,0), 5 + (0, 0, . . . , a n )\. 

Then the conditions of Theorem [TJ are satisfied with N = n, and hence 
(84) E g (W) > (1 - 2a -2e) min| l^-ZIU = (1 - 2a - 2e)a n . 

This is true for each g and hence (|18p follows. The last statement of the 
theorem follows from standard Gaussian tail inequalities. □ 

Proof of Theorem [3j We construct the appropriate set 0, and apply 
Theorem [TJ For simplicity, we build f2 around g = (0, ... ,0), the extension 
to arbitrary g being straightforward. Set a = a n from the statement of the 
theorem, and define 



F(x) 



Lx < x < a/L 

2a — Lx a/L < x < 2a/L. 



Note that F £ F{L) and that F minimizes ||-F||2 among all F £ F{L) 
with ||F|| oo — a. For simplicity, assume that 2aN j ' L — 1 for some integer 
N. Define F x (-) = F(-), F 2 (-) = F(- - 6),. . . , and F N (-) = F(- - N5). Let 
^(°) = • • j In} where fj = (Fj(xi), Fj(x n )). Now 

.. „ 2na 3 

and so 



31 



N 

Now apply Theorem [TJ 

To prove the last statement, we note that it is well known that if F is a 
kernel estimator with triangular kernel and bandwidth h = 0(n -1 / 3 ) then 



savE F (\\F-F\\ 00 )<C[-2-\ = Q 



fee 



logn\ 



1/3 



I? 



'n 



for some C > 0. Then B = (F — ^ , F + ^) (restricted to Xi = i/n) is valid 
by Markov's inequality and has the rate a n . □ 
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Proof outline of Theorem [H We will use the fact that an appropri- 
ately chosen wavelet basis forms a basis for T '. Let 



Jn ~ lot 



n 



l/(2p+l) ' 



2 



log n 



b n = Jlog(2^e 2 ) 



and 

F(s) = & n 2 J " /2 V>(2 Jn x) 

where ^ is a compactly supported mother wavelet. Then = b n 2 Jn ^ 2 2 pJn ^^ (2" 
so that f(F&)) 2 < c 2 for all large n so that F E J- '. 
Let / = (F(x i ),...,i ? (x„)). Then, 

||/||oo = 6n2 J " /2 = 0*(n-^ +1 )) 

and y^-ll/lb ~ \fnb n . Let = {F(x\ — kA), . . . ,F(x n — kA)) T where A is 
just large enough so that the F^'s are orthogonal. Hence, A ~ 1/N where 
N ~ 2 J ". Finally, set U = {f x , . . . , f N }. Then, 



e n\\f\\V^ 



N 



e nbi/a 2 2 J n < e 2 



for each / £ f2. The lower bound follows from Theorem [TJ 
A fixed-width procedure that achieves the bound is 

@i = fi Cn z a/m u i = fi C-n z a/n- 

where /, = F(x{), 

J 

j j=l k 



«i = n 1 T,i Y i ( l ) j(xi), Pjk = n 1 Y,i Y i^jk{x i ) and c n = y maxj, Var(F(x)). 

□ 

PROOF outline of Theorem [5j Again, we use the fact that an appro- 
priately chosen wavelet basis forms a basis for J-. Let 



log 



J, 



J r 2 



? ' 2 p 
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Let 

a n = —j=J\og2 J e 2 
Jn v 



and define F(x) = a n 2 J ' 2 ij}(x), where is a compactly supported mother 
wavelet. Then, ||/|| = a n , ||/||oo = o, n 2 J / 2 , and < c — 5 for all large n. 

Take Vl around g to be non-overlapping translations of F added to g. Then 
N ~ 2 J and conditions of Theorem [1] hold. Moreover, 

The bound is achieved by Markov applied to the soft-thresholded wavelet 
estimator with universal thersholding. □ 

Proof of Lemma [7J Q is the solution, with respect to c, to £ = 1 — 
Fo,m( r (c)) where the function r(c) = F~J^ m (0)) is monotonically increasing 
in c. Also, Fo iJn (r(0)) = (3 and Fo j7n (r(oo)) = 1 so a solution exists since 
< /3 < 1 — £ < 1. Now we bound Q from above. 

To upper bound Q it suffices to find c such that 

( 85 ) ^ > *5"m(l - 0- 

From Birge (2001) we have 



(86) < « + d + 2J(2« + d)log(l/(l-«))+21og(l/(l-u)) 



( 87 ) ^ s + d-2^/(2« + d)log(l/u). 

Hence, 



F c^K,m^ - rn + cy/m-2J(2cVm + m) log - 



F -^(l- 7 ) < m + 2Wmlog- + 21og-. 



It suffices to find c that satisfies 



(90) m + c\fm — 2\\ (2c\/m + m) log — > m + 2* \m\og — h 2 log — , 

V /? V ^ ^ 

or equivalently, 



(91) r>2j(-^ + l)logi+2f Jlogi + loi 
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The right hand side of the last inequality is largest when m = 1, and equality 
can be achieved when m = 1 at some A(/3, £) for any /?, £ satisfying the stated 
conditions. Equality can be achieved then for any m at some Q(m,(3,£) < 
A(/3, £). This proves the first claim. The second claim follows immediately 
by inspection. □ 

Proof of Lemma El Note that 



(92)min< \\v\\ : IMloo = 1 



(93) 



v 



mm Tj rj — 

|M|oo 

1 

I 1 1 oo 

max„ e ^ jo— 



(94) 



max< || v || oo : v S J 7 , \\v\\ = 1 



If v solves one of these problems then ev solves the more general version in 
the statement of the lemma. It now suffices to show just the second equality. 
Now, £l?r = maxj fij where 



a 



|e;|| lln^ej 



|n^ej 



Maximizing /j = ej f for / G T and 



< 1 is equivalent to maximiz- 



ing n{ei,f) = n(ILrej, /). The maximum subject to the constraint oc- 
curs at f* = nej/||riej||. Hence, the maximum is ejf* = (Tlei) T f* = 
n||IIei|| 2 /||IIei|| = 1 1 rie^ 1 1 2 / 1 1 TTe^ 1 1 jjTJ^II = Vn^i- Maximizing over i completes 
the proof. □ 



Proof of Lemma [TTJ We find a P £ Tj and a measure /i supported 
on A such that g? tv (-Po, Pp) < 25. We then have, following Ingster (1993), 



(95) 
(96) 
(97) 

(98) 
(99) 



> inf P^{^ = 0} 

> 1-^- sup \P (R) - P»(R)\ 

R: P (R)<€ 

> l-Z- S up\P (R)-P^R)\ 

R 



= 1 
> 1 



dry (Po j Pfj, 
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Let ipi, ip2i ■ ■ ■ j V'n be an orthonormal basis for M. n such that tpi , . . . , tpd 
form an orthonormal basis for T. Fix r > small and let A 2 = ne 2 /(n — 
d)+T 2 /(n-d). Define 

m 

(100) f E = \ E s ^ s , 

s=d+l 

where (E s : s = d+1, . . . ,n) are independent Rademacher random variables, 
that is, F{E S = 1} = P{£ s = -1} = 1/2. Now, TL r f E = and hence \ \f E - 
n^/sll 2 = A 2 > e 2 , and hence /gGi for each choice of the Rademachers. 

Let P^ = E(Pe) where Pe is the distribution under Je and the expecta- 
tion is with repect to the Rademachers. Choose fo € J 7 and let Pq be the 
corresponding distribution. As in Baraud, we use the bound 



(LOU (P.-.P- s L ('^f : J 



dP 



We take / = (0, . . . , 0) € T and so 



(102) (g(Y)) = Es ^exp|-iA 2 (n-d) + A £ £ s ]Ty^ 

n 

(103) = e" A2/2 J] cosh(A(y • VU)- 

s=d+l 

Since Eo cosh 2 (A(y • ^)) = e x2 cosh(A 2 ) and cosh(x) < e x2 ^ 2 we have 



(104) E ^(y)j = (cosh(A 2 ) 

(105) < e ("" a! ) A4 /2 



(106) = exp ( — — — -e 4 + — - — - + -r 2 e 2 ) . 
v ' l \2(n-d) 2(n-d) n-d j 

By the definition of e (in terms of <5), /3 > 1 — £ — <5 + O(r), and because this 
holds for every r, the result follows. 

□ 

PROOF of Lemma [131 Let /, g e ^4 be such that || / — < e. Then, 

Vg{L <f<U} 

(107) = P/{L</< U} + F g {L<f< U}-F f {L<f <U} 

(108) > P/{L <f<U} -d TV {P f ,P g ) 

(109) > l-a-M p (||/- 5 || p) A) 

(110) > l-a-M p (e(f,p),A). 
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We also have that P g {£ < g < U} > 1 — a. Hence, 

W 9 {L<g< U,L<f<U} 

(111) > ¥ g {L<g<U}+¥ g {L<f<U}-l 

(112) > l-a + l-a-M p (e(f,p),A)-l 

(113) > l-2a-M p (e(f,p),A). 

The event {L < g < U, L < f < U} implies that W > \\g — f\\oo- Hence, 

^f{W>\\f-g\\oo} > l-2a-M p (e(f,p),A) 

> l-2a-M p (e(f,p),A) 

> 1 - 2a - M p (e,A). 

It follows then that 

(114) ¥ f {W>e(f,oo)} = MF f {W>\\f-g\\ 00 }. 
and thus 

(115) inf ¥ f {W >e(f,oo)}>l-2a- sup M p (e(f,p), A). 

This proves the first claim. But e(/, oo) > e(f,p) for any 1 < p < oo. The 
final claim follows immediately. □ 

PROOF of Lemma [TH Choose / G Ao. Choose g £ Ai to minimize 
dTv(Pf,P g ) such to such that ||/-g||oo = £• Hence, d T y(pf,p g ) = mco(e, A Q , Ai). 
Then, 

W f {L<g<U} 

(116) = F g {L<g<U} + F f {L<g<U}-F g {L<g<U} 

(117) > F g {L<g<U}-d TV (P f ,P g ) 

(118) > l-a-m 00 (e,ylo,^i) 

because, by assumption. P 9 {£ < g < U } > I— a. We also have that P/{£ < / < U} > 
1 — a. Hence, 

P/{L< / <U,L<g<U} 

(119) > F f {L < f <U} + F f {L <g<U}-l 

(120) > 1-a + l - a-m^Aa, Ai) 

(121) > 1 - 2a - m OD (e, A , A\). 
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The event {L < f < U, L < g < U} implies that W > \\f - g\\oo- Hence, 

(122) F f {W> \\f-g\\oo} >l-2a-m oo (e,A ,A 1 ). 
It follows then that 

(123) sup ¥ f {W > e} > 1 - 2a - m^e, A , A x ). 
feA 

□ 

Proof of Theorem |T5j First, we compute m^e,^,^). Note that for 
all f e F, drv(P/,Po) = r(y/n\\f\\). Hence, m co (e, T , T) = r{^fnv) where 
v = min{||/|| : / 6f, WfW^ = e}. By LemmaEl v = e/(y/nQ^). It follows 
by Lemma [T4l that 

(124) sup F{W >w}>l-2a-T ( -^—] . 

Let w* = cjQr~ 1 (l— 2a— 7). It follows that if w < w* then infj e ^P{M^ < w} < 
1 — 7 which is a contradiction. 

That the proposed band has correct coverage follows easily. Now, (nn T )jj < 
iljr and z a /2n < y/clogn for some c and the claim follows. □ 

Proof of Theorem [T71 We break the argument up into three parts. 
Parts I and II taken together contribute the term from equation (|47p to the 
bounds. The logic of both parts is the same: find a value w* such that if w < 
w* then supj 6 jrP{Vv' > w} > 7. and, equivalently, inf j g jr P{ W < w } < 
1 — 7, which gives a contradiction under the assumptions of the theorem. 
Part III contributes the term v\ from equation f)48f) to the bounds. It is 
based on using the confidence bands to construct both an estimator and a 
test. Throughout the proof, we refer to the space V D T defined in equation 
(|62p ; this is the set of spoilers that are within e<i of T . 

Part I. First, we compute mo^tu, J 7 , T). Note that for all / E J 7 , d TV (P/, Po) = 
r(- v /n||/||/cj). Hence, m^w, J 7 , J 7 ) = T^^fnvjo) where v = min{||/|| : / £ 
"T~t ll/lloo = e }- By LemmaEl v = 'w/(^/n^l^). It follows by Lemma [T4l that 

(125) sup F{W >w}>l-2a-T ( ] . 
Take = aQ^T^ 1 (1 — 2a — 7). 
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Part II. Case (a.) 62 < too/V™- First, note that m^itf, J 7 , V) = T (\A^~7^) 
t(w/<j) for w < y/ne2, because the minimum two-norm for a given infinity- 
norm is achieved on the coordinate axis. Second, let Aq = T and A\ = V in 
Lemma [TU Then, for w < \fne2, 

(IV 
- 
a 

Let = o"min(r~ 1 (l — 2q — 7), £2-^)) then snpj e:F P{W > u>o} > 7. 

Case (b.) €2 > toc/Vn- First, note that m^w, J 7 , V) = t (v / "-^h) = 
r(w/a) for w < eoo- Second, let Aq = T and A\ = V in Lemma [TU Then, 
for io < e™, 



7/; 

(127) supP{W > w} > 1 - 2a -r ( - 



(7 



Let = cjmin(r~ 1 (l — 2a — 7), then supj- e:F ¥{W > wq} > 7. 

Part III. The argument here is based on an argument in Baraud (2004). 
Let / = (U + L)/2. Define a rejection region 

(128) n = {w > w} u - n/|| 2 > . 

Now, for any / G ^, f* = f, \ \f - Uf\\ 2 < \\f- f\\ 2 and 

(129) F f (K) < F f {W> w} +P/{||/ -n/|| 2 > VF/2} 

(130) < 7 + P / {||/-n/)| 2 >W/2} 

(131) < 7 + p / {||/-/l| 2 >W/2} 

(132) = 7 + F f {\\r-f\\ 2 >W/2} 

(133) < 7 + F/{lir-/lloo > W/2) 

(134) < 7 + a 

which bounds the type I error of 1Z. 

Now let / be such that ||/ — II/|| > max{w,e2}. Because ||/ — II/|| > 
||/ - n/||, ||/ - n/|| > e 2 implies that f* = f. And thus, 

(135) \\f-Uf\\ 2 >\\f-Uf\\2-\\f-f\\2>w-\\f-f\\ 2 . 
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Hence, 

(136) 

(137) 

(138) 

(139) 

(140) 

(141) 
(142) 



F f (K c ) 



< 
< 
< 

< 



\\f-Uf\\ 2 <W/2,W/2<w/2) 

\\f-Uf\\ 2 <w/2,W<w} 

\\f-f\\2>w/2,w>w} 

\\f-f\\2>W/2} 

\\r-f\\2>W/2} 

\\r-f\\oo>W/2} 



< a. 



Thus, 7Z defines a test for Hq : f G T with level a + 7 whose power more 
than a distance max{w, e 2 } from T is at least 1 — a. Using Lemma [TT] with 
£ = a + 7 and (5 = 1 — 7 — 2a, this implies that 



(143) max{w,e2} > 2n(a,'y){n — d) 

The result follows. 



□ 



Proof of Theorem [181 The volume is minimized by making as 
large as possible and e 2 as small as possible. To achieve the lower bound 
on the width requires < w?r and e 2 > 2n{a^){n — d) 1//4 n~ 1//2 . □ 

Proof of Theorem [19J Let A = |r < x\ n -d}- Then ' 

W* £ = P/{/* £ B, A} + P/{/* £ i?,^ c } . 

We claim that P/{/* £ B,A} < a/2 and P/{/* $ B,A C } < a/2. There are 
four cases. 

Case I. f G T. Then f = f* and P/{/ ^ B, A c } < F f {A c } < a/2. 
F f {f $B,A}< F f {f iB} = Pn/{n/ i B] < P n/ {||/ - Uf]^ > w T ) < 
a/2. 

Case II. feV-T where V = {/ : ||/ - n/|| < e 2 , ||/ - n/|U < ej. 
Again, / = /*. First, F f {f(/B,A C } < F f {\\Y - /IU > z a/2n ) < a/2. 

Next, we bound F f {f <£ B,A}. Note that / = liY ~ iV(#, cr 2 nn T ), where 
5 = IT/. Then £ ~ i\T(^i, O 2 ). Let £ = (L + e^, U - Then, Uf € B 
implies / G S and P/{^ 5,4} < P/{n/ $ B } < a/2. 
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Case III. f$V, ||/-n/|| < e 2 and H/-II/IU > eoo . In this case, f* = 
II/. ThenP/{/*,/ G B C ,A C } < F f {f G B C ,A C } < a/2. Also, P/{/*,/ G -B c ,^} < 

£ B} = P n/ {n/ ^ -B} < Pn/{||/ - n/IU > ^} < a/2. 
Case IV. f (/V and || / - n/|| > e 2 . In this case, /* = /. But 

F f {f $B, A} < F f {A} < F f ^uf,n-d(xln-d) < Fe 2 ,n-d(xln-d) < <*/2 
and 

P/{/ i B,A C } < F f {f i B,A C } < a/2. 

Thus, P/{/* B} < a. Equation Qjgft follows since P/{t < X^n-d} ^ 
1 - 7 for all / G T. □ 

Proof of Lemma [23j First note that if B is a ball in W 1 in any norm, 
then B — B = 2B. Second, we have that 

(144) uj{u) = su V {\Tg\: \\g\\ 2 <u, g£V-V} 

(145) = sup{|T 5 | : \\g\\ 2 < u, g G V{2e 2 , 2e OCl )}. 

To see the latter equality, note that if g,h G V, then we can write g — h = 
f + 5i — 5 2 where / G T and 5{ are in B^-(0, e^) for A; = 2, oo. Thus, <5i — 5 2 
isin2B^-(0 J e 2 )n2Bi(0,e oo ). 

Set £*(/) = 2e 2 ) D 2 eoo ). We have that 

(146) wfa^F) = sup{/i: ||/|| 2 < V , f G T} 

(147) w(»7,B*(0)) = sup{/i: [|/[| 2 < »7, / 6 B*(0)}. 

For any g G V(2e 2 ,2e oa ), we can write g = g\ + g 2 where g\ G T and 
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92 £ B* (0) and the two functions are orthogonal. Then, 



(148)w(u,V) = supiT(g) : 5 G V(2e 2 , 2 £oo ), \\g\\ 2 <u 



= sup < T(gi + g 2 ) : \\gih < V u 2 - c 2 , \\g 2 \\ 2 < c 2 , 

0<c<« L 



(149) 

(150) 
(151) 



< sup 

0<c<« 



sup 

0<c<u 



9l ef,g 2 G B*(0) 



sup T( 5 i)+ sup T{g 2 ) 

aier s 2 6-B*(o) 

_\\9l\\2<Vu 2 ~c2 Il32ll2<= 

w(Vt/ 2 -c 2 ,f)+w(c,_B*(0)) 



Moreover, equality can be attained for each c by choosing #i and g 2 to be 
the maximizers (or suitably close approximants thereof) of each term in the 
last equation. Consequently, 



(152) 



uj{u)= sup uj(Vu 2 - c 2 ,T) +lo(c, B*(0)). 

0<c<u 



To derive oj(t], B*(0)), note that / = ((77 A £2)^/™ A e^, 0, 0, . . . , 0) max- 
imizes fi subject to the norm constraint. Hence, uj(i], B* (0)) = min((?7 A 
e2)v / ^ £ oc)- For ufaf), let e = (1,0,..., 0) G M n . Recall that Qjr = 
i/m'mtt 76 ^! = ttt^, which is between and 1. Maximizing e T f for f £ J 7 

\\e\\ |ll^re| ||e|| ' ° J * 

and H/H2 < rj is equivalent to maximizing n(e, f) = n(H^e,f). The maxi- 



mum subject to the constraint occurs at /* = r/IIe/||IIe|| Hence, 00(1],^) = 
rj^/nQjr. Note that i] is in terms of the normalized two norm; in the "natural" 
(root sum of squares) norm, the modulus would be oj^(u, J 7 ) = uQ^. 
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It follows that 



(153) 
(154) 
(155) 

(156) 
(157) 



sup [uj{^v? -c 2 ,F) +w(c,fl*(0))] 

0<c<n 



sup 

0<c<« 



Vnf^v -u 2 — c 2 + min((c A e2)\/n, e c 



-v/n sup Vlj^y/ u 2 — c 2 + min(c, £2 A (eoo/^/n)) 
0<c<u L 



Q 2 



1 + ft 2 



+ min( 



Vi + ft 2 



,£2 A (eoo/Vn)) 



r VL 2 . ( Uy/H 

uyjn\l\ - --r + mm , 62V n ) e c 



1 + ft 2 



because the supremum over c is maximized at c = ii/(l + f2 2 ). In the natural 
two norm, we have 



(158) u; k (u,V) 



o ^ 2 • ' " 

uS2i / — + mm 



ft 2 



1 + fi 2 



v 1 + n 2 



□ 



Next, we prove the lower bound result generalized to a nested sequence 
of subspaces. To do so, we need to prove several auxilliary lemmas. Define 
for each 1 < j < m, 

(159) Uj = {/ € R n : F*(/) = {n,/,/} or F*(/) = {/}} . 

Referring to the definition of V in equation (f62"1) . define here Vj = V{Tj, 62 j, Cooj)- 
Lemma 32. Letw>0. Then, 



(160) 
(161) 



moo (to, ^ n Uj ,VjnUj) = moo Fj , Vj ) 



Proof. First, let f,g £ J-j be the minimal pair for 77100(11), J-j, J-j). Let 
ip be a unit-2-norm vector in JFj D Let A > £2,1 and define 



(162) 
(163) 



/ = M> + f 
9 = W + g- 
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Then, f,g£ Tj n Uj because if either f or g were in Tj n U? then adding 
Xtjj makes the distance from the projection on one of the lower spaces larger 
than the corresponding Also d TV (Pj, P~) = d TV (Pf, P g ) and ||/ — g\\oo = 
\\f—g\\oo- Hence, m^w, TjV\Uj , TjPiUj) < m^w^Tj^Tj). But TjT\Uj C Fj, 
so m 00 (w,J r j n Uj,Tj H Uj) = rriooiw , !Fj , Fj) as was to be proved. 

Second, let / 6 ^ and g S 7j be the minimal pair for m 00 (w, Fj, Vj). Now 
apply the same argument. 

□ 

Lemma 33. Let < 5 < 1 — £ and 

(164) e = ^^(21og(l + 4^)) 1/4 . 

De/ine A, = ^ D {/ : ||/ - Ujf\\ > e}. T/ien, 

(165) /3= inf sup P f {^ £ = 0} > 1 - f - 5 

(166) $ ? = i ^ : sup P/{<^ = 0} < Q 

I /e^ J 

is the set of level £ tests. 

Proof. Let Je be defined as in equation (|1UU|) in the proof of Lemma 
[TTl Let ^ be a unit vector in .Fj+i Pi Tj~ and let A > £2,1- Then, define 
Je = Xip + Je- Now apply the proof of Lemma [TTl using /o = A^ instead 
of 0. The total variation distances among corners of the hypercube do not 
change and the result follows. 

□ 

Lemma 34. Fix < a < 1 and < 7 < 1 — 2a. Suppose that for bands 
B = (L,U) 

(167) inf P / {F*(/)n J B/0}>l-a. 
Then 

(168) inf F f {W < w} > 1 -7. 
implies 

(169) w >w(J r j,e 2 ,j,e 00t j,n,dj,a,j,a), 
where w is given in Theorem \ 1 1\ 
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Proof. To prove this lemma, we will adapt the proof of Theorem 1171 as 
follows. By Lemma EH the argument for Parts I and II is the same with T 
replaced with Tj n Uj and V replaced with Vj n Uj . By replacing the reference 
to Lemma [TT] with Lemma [33j the argument for Part III also follows exactly. 
The result follows. 

□ 

Proof of Theorem [2SJ The result follows directly from Lemma IMlbe- 
causeinf /gR nP{F*(/) nB / 0} > 1-a implies mi feUj F{F*(f) n B ± 0} > 
1 — a. 

□ 

Proof of Theorem [28j Note that ¥> f {F* n B = 0} = £jP/{f* n B = 0, J 

We show that Py|-F* D B = 0, J = j j < aj for each j. There are three cases. 
Throughout the proof, we take a = 1. 
Case /. Wf-IljfW > e 2j -. Then, 

P / {f*nS = 0,J = i} < P / {j = i}<F / _n 3 /,n^ J (x^- dj ) 

— -^2,i ,n— dj (X 7) n-dj ) 
< ay 

due to £23). 

Case //. ||/ - IIj-ZH < e 2ii and ||/ - UjfW^ < e^j. So, 



f 



F*nB = <t>,J = j} < F f {f^B,J = j} 



< 



I /{ll/-/lloo>^+eoo J } 

< p / {||/-n i /|| 0O + ||n i /-n i F|| 0O >^.+e 0O , i } 

< p / {||n j /-n J -F|| 00 > w Tj ) 
= : ; 'u/{ ii,/ n,>- x -«•,•} 

< Oj. 

Case III. ||/ - Ujf\\ < 62j and ||/ - ILjfW^ > e^j. Now, 

P/{F*n£ = 0,J = j} < p / {n,/^ J B,j = j} 

= p / {||n i r-n J /|| 0O > Cj ,j = j} 

< P f {\\U j Y-U j f\\ 00 >c j } 

= Pn,/{||n i r-n J /|| 00 >c i } 



< CXj. 
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To prove ([771) . suppose that / G Fj. Then, Py| J > j j < 7. But, as long 
as J < j, W = w^(aj) + e^j < Wj(ctj) + eooj. The last statement follows 
since, when t<ij > Q(n — dj,a/2,j)(n — dj) 1//4 n -1 / 2 □ 

5. Discussion. We have shown that adaptive confidence bands for / 
are possible if coverage is replaced by surrogate coverage. Of course, there 
are many other ways one could define a surrogate. Here, we briefly outline 
a few possibilities. 

Wavelet expansions of the form 

f( x ) = a j<i>j( x ) + fijk^jk 

lend themselves quite naturally to the surrogate approach. For example, one 
can define 

j j k 

where s(x) = sign(x)(|x| — A)+ is the usual soft-thresholding function. 

For kernel smoothers and local polynomial smoothers fh that depends on 
a bandwidth h, a possible surrogate is /* = E(fh*) where h* is the largest 
bandwidth h for which passes a goodness of fit test with high probability. 
In the spirit of Davies and Kovac (2001), one could take the test to be a test 
for randomness applied to the residuals. 

Motivated by ideas in Donoho (1988) we can define another surrogate as 
follows. Let us switch to the problem of density estimation. Let X±, . . . , X n ~ 
F for some distribution F. The goal is define an appropriate surrogate band 
for the density /. Define the smoothness functional S(F) = J(f"(x)) 2 dx. 
To make sure that S(F) is well defined for all F we borrow an idea from 
Donoho (1988). Let <&h denote a Gaussian with standard deviation h and 
define S(F) = lim/^o S(F(B^h) where © denote convolution. Donoho shows 
that S is then a well-defined, convex, lower semicontinuous functional. 

Let F n be the empirical distribution function and let B = B(F,e n ) = 
{F : \\F — F n \ \ < e n } where || • || is the Kolmogorov-Smnirnov distance and 
e n is the 1 — (3 quantile of 1 1 U — U n \ \ where U is the uniform distribution and 
U n is the empirical from a sample from U. Thus, B is a nonparametric, 1 — /3 
confidence ball for F. The simplest F G B is the distribution that minimize 
S(F) subject to F G B. We define the surrogate F* to be the distribution 
that minimizes S(F) subject to F belonging to Bp, where Bp is a population 
version of B. We might then think of F* as the simplest distribution that is 
not empirically dinstinguishable from F. A natural definition of Bp might 
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be Bp = {G : \ \F — G\ \ < e n }. But this definition only makes sense for fixed 
radius confidence sets. Another definition is Bp = {67 : Fp{G G B} > 1/2}. 
To summarize, we define 

(170) F* = argmin F6BF S(F) 
where 

(171) B F = jc? : Fp{G g B(F n , e n ) } > 1/2 j 

and B(F n , e n ) = {G : | \F n - G\ \ < e n }. Let 

(172) T = U{G' t : GeB(F n ,e n )}. 
Then 

(173) £(x) = inf F'(x), u(x) = sup F'(x) 

defines a valid confidence band for the density of F*. 

Let us also mention average coverage (Wahba 1983; Cummins, Filloon, 
Nychka 2001). Bands (L, U) have average coverage if < /(0 < U(£)} > 

1 — a where £ ~ Uniform(0, 1). A way to combine average with the surrogate 
idea is to enforce something stronger than average coverage such as 

P/{£(0 < f(0 < U(0 and / ^ /} > 1 - a 

where / = {L + U)/2 and / ■< f means that / is simpler than / according 
to a partial order for example, / ^ g if J(f") 2 < j(g") 2 . 
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